multi+refactor: persistent peer manager #5700

ellemouton · 2021-09-07T14:11:17Z

This PR adds a PersistentPeerManager to the server and refactors all persistent peer logic to be handled by the PersistentPeerManager. The end result is that persistentPeers, persistentPeersBackoff, persistentConnReqs and persistentRetryCancels are all removed from the server struct and replaced by a PersistentPeerManager.

orijbot · 2021-09-07T14:11:19Z

Visit https://dashboard.github.orijtech.com?back=0&pr=5700&remote=true&repo=ellemouton%2Flnd to see benchmark details.

peer/connmgr.go

lntest/itest/lnd_network_test.go

yyforyongyu

cACK, I like the direction where we start moving code out of the server.go, def involves more work tho. My question is, since we are already here, can we move all connection management into peer?

lntest/itest/lnd_network_test.go

ellemouton · 2021-09-09T07:28:28Z

def involves more work tho

Mind elaborating? Do you mean in general or specifically in this PR?

My question is, since we are already here, can we move all connection management into peer?

Very happy to do so! Do you think it should all be in 1 PR?

yyforyongyu · 2021-09-09T18:24:45Z

Mind elaborating? Do you mean in general or specifically in this PR?

Meant in this PR. Since we are moving the code outside of server.go, it's finally a good chance to patch the missing unit tests and refactor some of them if needed. But yeah, it would involve more work.

Very happy to do so! Do you think it should all be in 1 PR?

It depends. I'd think about how emergent issue #5377 is. If it's a pressing issue, I'd suggest we move to #5538 and land it first, since it's almost done and has been reviewed. If not, we might as well take the chance to refactor the connection manager here, moving it into its own package. This will def take more time as the scale is large, but we should do it imo. Another option is to continue what you have here, that we move out the persistent manager as the first step. Guess it's a judgment call from @Roasbeef .

viaj3ro · 2021-09-27T13:05:07Z

If it's a pressing issue, I'd suggest we move to #5538 and land it first, since it's almost done and has been reviewed

I'd argue it is pressing. I still have a handful of TOR peers that refuse to connect back to me even after almost 4 months. Since many nodes have IP address changes once in a while, I assume they have the same issue without ever noticing it.

ellemouton · 2021-09-27T15:15:41Z

holding off on this PR until 0.15. Working on the solution in #5538 again instead 👍

hsjoberg · 2021-09-29T23:39:22Z

I'd also argue it's quite pressing, I have several users of my wallet which have encountered this issue.

This commit moves over the logic in main server's cancelConnReqs function over to the PersistentPeerManager and adds a unit test for it.

In this commit, we take note of the fact that PersistentPeerManager.DelPeer(pubKey) is always preceeded by PersistentPeerManager.CancelConnReqs(pubKey). And so DelPeer is adapted to always cancel the peers connReqs.

Since the PersistentPeerManager will always lookup persisted advertised addresses for any peer we add to it, there is no need to do this outside the manager.

In order to prevent a situation where an old address map is used for connection request creations, always cancel any previous retry cancelles before initialising a new one.

Will update to 0.16 once the file for it is in the code base.

ellemouton · 2022-05-24T14:23:45Z

Ok, this is finally ready for re-review :) I have restructured the commits significantly which will hopefully make review easier

Crypt-iQ

checkpointing review here

Crypt-iQ · 2022-09-02T20:54:09Z

peer/persistentpeer_mgr.go

+	"sync"
+
+	"github.com/btcsuite/btcd/btcec/v2"
+	"github.com/lightningnetwork/lnd/routing/route"


I don't think route.Vertex needs to be introduced

Crypt-iQ · 2022-09-07T16:19:07Z

server.go

@@ -359,7 +356,9 @@ func (s *server) updatePersistentPeerAddrs() error {
 					// We only care about updates from
 					// our persistentPeers.
 					s.mu.RLock()
-					_, ok := s.persistentPeers[pubKeyStr]


I think it's possible to remove the server mutex (un)locking calls, since those seem to be handled by the new manager. Though maybe it's best to not change any assumptions here and keep it how you did it

Crypt-iQ · 2022-09-07T18:54:08Z

peer/persistentpeer_mgr.go

+	relaxedBackoff := computeNextBackoff(currentBackoff, maxBackoff)
+	relaxedBackoff -= connDuration
+
+	if relaxedBackoff > maxBackoff {


should be minBackoff

Crypt-iQ · 2022-09-07T19:09:56Z

peer/persistentpeer_mgr.go

+
+	// retryCanceller is used to cancel any retry attempts with backoffs
+	// that are still maturing.
+	retryCanceller *chan struct{}


I think you could use a regular channel and a boolean that, when set, means that the channel was closed. They both achieve the same thing, but I think the bool approach is nicer

joostjager

Did another pass of this PR focusing on concurrency issues first. I have to say that I find it hard to be certain that there are none with the extensive use of the lock, goroutines and cancel channel.

After my previous review, I mentioned:

I wonder if it would be easier to understand with a single event loop per peer that receives updates via a channel.

I still think that would help a lot. Not just for review, but also for future devs working on this code.

Generally speaking, the connection handling for a peer is independent of all other peers. If that would be reflected in the code as a layer, it's already a bit easier to see how things work.

Then within the code for handling a single peer, I think that a single event loop that deals with retries, new addresses, backoff and additional connection requests will make the logic much more transparent.

joostjager · 2022-09-15T09:38:37Z

peer/persistentpeer_mgr.go

+	peerKey := route.NewVertex(pubKey)
+
+	// Fetch any stored addresses we may have for this peer.
+	advertisedAddrs, err := m.cfg.FetchNodeAdvertisedAddrs(peerKey)


Is it possible to miss a graph update in between this fetch and adding the peer to the list of connections? Maybe safer to swap the order?

i think it's safer to add it to the map and then update it with the fetched addrs

otherwise i think it's possible that:

we call ConnectPeer in the server

AddPeer is called and this ends up racing

joostjager · 2022-09-15T09:41:50Z

peer/persistentpeer_mgr.go

+
+	backoff := m.cfg.MinBackoff
+	if peer, ok := m.conns[peerKey]; ok {
+		backoff = peer.backoff


If the peer is already being tracked, does the fetch call above still add something? Perhaps setPeerAddrsUnsafe can just append new addrs?

joostjager · 2022-09-15T09:46:06Z

peer/persistentpeer_mgr.go

+	// connection requests for. So create new connection requests for those.
+	// If there is more than one address in the address map, stagger the
+	// creation of the connection requests for those.
+	go func() {


Does this need to be added to the waitgroup for when Stop is called? The actual Connect call is async too.

This is pre-existing, I guess because of this comment which I don't entirely understand:

lnd/server.go

Lines 4039 to 4041 in 22fec76

// We choose not to wait group this go routine since the Connect

// call can stall for arbitrarily long if we shutdown while an

// outbound connection attempt is being made.

joostjager · 2022-09-15T09:52:38Z

peer/persistentpeer_mgr.go

+
+	// We choose not to wait group this go routine since the Connect call
+	// can stall for arbitrarily long if we shutdown while an outbound
+	// connection attempt is being made.


Wondering why this is. Is there a timeout missing?

joostjager · 2022-09-15T10:07:05Z

peer/persistentpeer_mgr.go

+	// Next, check to see if we have any outstanding persistent connection
+	// requests to this peer. If so, then we'll remove all of these
+	// connection requests, and also delete the entry from the map.
+	if len(peer.connReqs) == 0 {


I think at this point, there may still be a ConnectPeer goroutine blocked on obtaining the lock. After this function returns and unblocks, ConnectPeer may still add something to the connReqs?

Yes this seems possible, if the goroutine spawned by ConnectPeer doesn't get canceled by the chan and is waiting on the mutex lock

joostjager · 2022-09-15T10:12:41Z

peer/persistentpeer_mgr.go

+	FetchNodeAdvertisedAddrs func(route.Vertex) ([]net.Addr, error)
+}
+
+// PersistentPeerManager manages persistent peers.


Maybe add a basic description of what is/makes a peer persistent?

joostjager · 2022-09-15T10:16:00Z

peer/persistentpeer_mgr.go

+	}
+
+	peer.connReqs = updatedConnReqs
+	cancelChan := peer.getRetryCanceller()


I think there may be a race condition with running goroutines created by the previous ConnectPeer that are blocked on the mutex?

with peer.connReqs?

joostjager · 2022-09-15T10:17:47Z

peer/persistentpeer_mgr.go

+	// exist.
+	cancelChan := peer.getRetryCanceller()
+
+	// We choose not to wait group this go routine since the Connect call


Isn't the ConnectPeer call doing its connects in a goroutine?

Crypt-iQ · 2022-11-03T17:56:35Z

server.go

-			break
-		}
-
-		// If we are, the peer's address won't be known


i think this changes the behavior since now the peermgr will attempt to connect to the incoming address - before this change, the attempt wasn't made.

also, the original logic is slightly off. If we get an incoming connection and don't have any other addresses for the peer, we'll attempt to connect to the addr+port, but they may not actually be listening on the port given how tcp works. The issue seems to be that when we create a link node, we'll use the remote address as the address to store and expect to be able to connect to it

Crypt-iQ · 2022-11-07T16:55:06Z

peer/persistentpeer_mgr.go

+	peerKey := route.NewVertex(pubKey)
+
+	// Fetch any stored addresses we may have for this peer.
+	advertisedAddrs, err := m.cfg.FetchNodeAdvertisedAddrs(peerKey)


i think it's safer to add it to the map and then update it with the fetched addrs

otherwise i think it's possible that:

we call ConnectPeer in the server

AddPeer is called and this ends up racing

Crypt-iQ · 2022-11-07T17:45:10Z

peer/persistentpeer_mgr.go

+		ticker := time.NewTicker(m.cfg.MultiAddrConnectionStagger)
+		defer ticker.Stop()
+
+		for _, addr := range addrMap {


doesn't seem like a big deal, but it's possible that the previous goroutine adds to connReqs even though the cancelChan is closed

lightninglabs-deploy · 2023-01-25T00:50:24Z

@bhandras: review reminder
@Roasbeef: review reminder
@yyforyongyu: review reminder
@ellemouton, remember to re-request review from reviewers when ready

ellemouton · 2023-01-25T05:35:11Z

!lightninglabs-deploy mute

ellemouton · 2023-01-25T05:35:37Z

muting for a bit. Will pick this up again when my plate clears up a bit

ellemouton · 2023-07-27T15:27:26Z

closing for now. Can re-open once re-prio'd

ellemouton mentioned this pull request Sep 7, 2021

multi: refresh peer IP during reconnect #5538

Merged

ellemouton commented Sep 7, 2021

View reviewed changes

peer/connmgr.go Outdated Show resolved Hide resolved

ellemouton force-pushed the persisentPeerManager branch from 6998d8a to 601ffa2 Compare September 7, 2021 14:19

ellemouton requested review from yyforyongyu, Roasbeef and bhandras September 7, 2021 14:20

Crypt-iQ added discovery Peer and route discovery / whisper protocol related issues/PRs p2p Code related to the peer-to-peer behaviour labels Sep 7, 2021

alexbosworth reviewed Sep 7, 2021

View reviewed changes

lntest/itest/lnd_network_test.go Outdated Show resolved Hide resolved

ellemouton force-pushed the persisentPeerManager branch from 601ffa2 to fccacb9 Compare September 8, 2021 06:46

yyforyongyu reviewed Sep 8, 2021

View reviewed changes

lntest/itest/lnd_network_test.go Outdated Show resolved Hide resolved

ellemouton force-pushed the persisentPeerManager branch 3 times, most recently from 155c941 to 9f1d37c Compare September 14, 2021 14:13

ellemouton changed the title ~~[WIP] multi: persistent peer manager~~ multi: persistent peer manager Sep 14, 2021

ellemouton force-pushed the persisentPeerManager branch 2 times, most recently from 363859c to 526f660 Compare September 14, 2021 17:48

ellemouton requested a review from yyforyongyu September 15, 2021 05:25

ellemouton force-pushed the persisentPeerManager branch 3 times, most recently from 24a11d5 to a618cc8 Compare September 21, 2021 07:08

ellemouton force-pushed the persisentPeerManager branch 2 times, most recently from 4c4b743 to f1f7bd8 Compare October 7, 2021 15:14

ellemouton changed the title ~~multi: persistent peer manager~~ multi+refactor: persistent peer manager Oct 7, 2021

ellemouton added 11 commits May 24, 2022 13:42

peer+server: move cancelConnReqs to PersistentPeerManager

6756734

This commit moves over the logic in main server's cancelConnReqs function over to the PersistentPeerManager and adds a unit test for it.

peer+server: deleting the peer should always cancel conn reqs

194de63

In this commit, we take note of the fact that PersistentPeerManager.DelPeer(pubKey) is always preceeded by PersistentPeerManager.CancelConnReqs(pubKey). And so DelPeer is adapted to always cancel the peers connReqs.

peer+server: move node addr update logic to PersistentPeerManager

73c105d

peer+server: add addrs on persistent peer creation

b7e26cf

peer+server: only add supported addresses

cdcdfb6

peer+server: allow PersistentPeerMgr to fetch advertised addrs

e226516

peer+server: let PersistentPeerManager do addr conversions

a41637e

server: remove redundant address lookups

71b397d

Since the PersistentPeerManager will always lookup persisted advertised addresses for any peer we add to it, there is no need to do this outside the manager.

peer+server: Connect with backoff to PersistentPeerManager

c11367e

peer: always cancel any previous retry cancellers

b7bd6f9

In order to prevent a situation where an old address map is used for connection request creations, always cancel any previous retry cancelles before initialising a new one.

docs/release-notes: update with entry for 5700

077d9ca

Will update to 0.16 once the file for it is in the code base.

ellemouton force-pushed the persisentPeerManager branch from ac5eebf to 077d9ca Compare May 24, 2022 11:44

Crypt-iQ self-requested a review June 17, 2022 14:35

yyforyongyu mentioned this pull request Jun 27, 2022

server.go: replace call to removePeer with Disconnect in DisconnectPeer #6655

Merged

guggero mentioned this pull request Aug 29, 2022

New persistent connection request keeps spawning indefinitely for offline channel peers. #6866

Closed

This was referenced Sep 2, 2022

Proposal: Add support for proxying p2p connections to/from lnd #6843

Draft

server.go: add peerChan to peerConnectedListeners in NotifyWhenOnline #6892

Merged

Crypt-iQ reviewed Sep 7, 2022

View reviewed changes

joostjager reviewed Sep 15, 2022

View reviewed changes

Crypt-iQ reviewed Nov 7, 2022

View reviewed changes

saubyk modified the milestones: v0.16.0, v0.16.1 Dec 6, 2022

Crypt-iQ mentioned this pull request Jan 3, 2023

WIP: Introduce PeerConnManager to manage peer connections #7283

Draft

1 task

ellemouton closed this Jul 27, 2023

saubyk removed this from the v0.17.2 milestone Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi+refactor: persistent peer manager #5700

multi+refactor: persistent peer manager #5700

ellemouton commented Sep 7, 2021 •

edited

Loading

orijbot commented Sep 7, 2021

yyforyongyu left a comment

ellemouton commented Sep 9, 2021 •

edited

Loading

yyforyongyu commented Sep 9, 2021

viaj3ro commented Sep 27, 2021

ellemouton commented Sep 27, 2021

hsjoberg commented Sep 29, 2021

ellemouton commented May 24, 2022

Crypt-iQ left a comment

Crypt-iQ Sep 2, 2022

Crypt-iQ Sep 7, 2022

Crypt-iQ Sep 7, 2022

Crypt-iQ Sep 7, 2022

joostjager left a comment

joostjager Sep 15, 2022

Crypt-iQ Nov 7, 2022

joostjager Sep 15, 2022

joostjager Sep 15, 2022

Crypt-iQ Nov 8, 2022

joostjager Sep 15, 2022

joostjager Sep 15, 2022

Crypt-iQ Nov 8, 2022

joostjager Sep 15, 2022

joostjager Sep 15, 2022

Crypt-iQ Nov 8, 2022

joostjager Sep 15, 2022

Crypt-iQ Nov 3, 2022 •

edited

Loading

Crypt-iQ Nov 7, 2022

Crypt-iQ Nov 7, 2022

lightninglabs-deploy commented Jan 25, 2023

ellemouton commented Jan 25, 2023

ellemouton commented Jan 25, 2023

ellemouton commented Jul 27, 2023

	// We choose not to wait group this go routine since the Connect
	// call can stall for arbitrarily long if we shutdown while an
	// outbound connection attempt is being made.

multi+refactor: persistent peer manager #5700

multi+refactor: persistent peer manager #5700

Conversation

ellemouton commented Sep 7, 2021 • edited Loading

orijbot commented Sep 7, 2021

yyforyongyu left a comment

Choose a reason for hiding this comment

ellemouton commented Sep 9, 2021 • edited Loading

yyforyongyu commented Sep 9, 2021

viaj3ro commented Sep 27, 2021

ellemouton commented Sep 27, 2021

hsjoberg commented Sep 29, 2021

ellemouton commented May 24, 2022

Crypt-iQ left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joostjager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Crypt-iQ Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lightninglabs-deploy commented Jan 25, 2023

ellemouton commented Jan 25, 2023

ellemouton commented Jan 25, 2023

ellemouton commented Jul 27, 2023

ellemouton commented Sep 7, 2021 •

edited

Loading

ellemouton commented Sep 9, 2021 •

edited

Loading

Crypt-iQ Nov 3, 2022 •

edited

Loading